16 research outputs found
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models
The pre-training and fine-tuning paradigm has contributed to a number of
breakthroughs in Natural Language Processing (NLP). Instead of directly
training on a downstream task, language models are first pre-trained on large
datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then
fine-tuned on task-specific data (e.g., natural language generation, text
summarization, etc.). Scaling the model and dataset size has helped improve the
performance of LLMs, but unfortunately, this also lead to highly prohibitive
computational costs. Pre-training LLMs often require orders of magnitude more
FLOPs than fine-tuning and the model capacity often remains the same between
the two phases. To achieve training efficiency w.r.t training FLOPs, we propose
to decouple the model capacity between the two phases and introduce Sparse
Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits
of using unstructured weight sparsity to train only a subset of weights during
pre-training (Sparse Pre-training) and then recover the representational
capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We
demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3
XL model resulting in a 2.5x reduction in pre-training FLOPs, without a
significant loss in accuracy on the downstream tasks relative to the dense
baseline. By rigorously evaluating multiple downstream tasks, we also establish
a relationship between sparsity, task complexity and dataset size. Our work
presents a promising direction to train large GPT models at a fraction of the
training FLOPs using weight sparsity, while retaining the benefits of
pre-trained textual representations for downstream tasks.Comment: Accepted to Uncertainty in Artificial Intelligence (UAI) 2023
Conference; 13 pages, 4 figures (Main Paper) + 5 pages (Supplementary
Material
Towards Understanding Label Regularization for Fine-tuning Pre-trained Language Models
Knowledge Distillation (KD) is a prominent neural model compression technique
which heavily relies on teacher network predictions to guide the training of a
student model. Considering the ever-growing size of pre-trained language models
(PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is
evident that in KD, deploying the teacher network during training adds to the
memory and computational requirements of training. In the computer vision
literature, the necessity of the teacher network is put under scrutiny by
showing that KD is a label regularization technique that can be replaced with
lighter teacher-free variants such as the label-smoothing technique. However,
to the best of our knowledge, this issue is not investigated in NLP. Therefore,
this work concerns studying different label regularization techniques and
whether we actually need the teacher labels to fine-tune smaller PLM student
networks on downstream tasks. In this regard, we did a comprehensive set of
experiments on different PLMs such as BERT, RoBERTa, and GPT with more than 600
distinct trials and ran each configuration five times. This investigation led
to a surprising observation that KD and other label regularization techniques
do not play any meaningful role over regular fine-tuning when the student model
is pre-trained. We further explore this phenomenon in different settings of NLP
and computer vision tasks and demonstrate that pre-training itself acts as a
kind of regularization, and additional label regularization is unnecessary
Derivation of Mouse Haploid Trophoblast Stem Cells
Summary: Trophoblast stem (TS) cells are increasingly used as a model system for studying placentation and placental disorders. However, practical limitations of genetic manipulation have posed challenges for genetic analysis using TS cells. Here, we report the generation of mouse parthenogenetic haploid TS cells (haTSCs) and show that supplementation with FGF4 and inhibition of Rho-associated protein kinase (ROCK) enable the maintenance of their haploidy and developmental potential. The resulting haTSCs have 20 chromosomes, exhibit typical expression features of TS cells, possess the multipotency to differentiate into specialized trophoblast cell types, and can chimerize E13.5 and term placentas. We also demonstrate the capability of the haTSCs to undergo genetic manipulation and facilitate genome-wide screening in the trophoblast lineage. We expect that haTSCs will offer a powerful tool for studying functional genomics and placental biology. : Trophoblast stem (TS) cells are increasingly used as a model system for studying placentation and placental disorders. Here, Cui et al. report the generation of mouse haploid TS cells, which possess a wide extraembryonic developmental potential and can serve as a powerful tool for studying functional genomics and placental biology. Keywords: haploidy, trophoblast, stem cells, TS
LossâofâFunction of p21âActivated Kinase 2 Links BMP Signaling to Neural Tube Patterning Defects
Abstract Closure of the neural tube represents a highly complex and coordinated process, the failure of which constitutes common birth defects. The serine/threonine kinase p21âactivated kinase 2 (PAK2) is a critical regulator of cytoskeleton dynamics; however, its role in the neurulation and pathogenesis of neural tube defects (NTDs) remains unclear. Here, the results show that Pak2â/â mouse embryos fail to develop dorsolateral hinge points (DLHPs) and exhibit craniorachischisis, a severe phenotype of NTDs. Pak2 knockout activates BMP signaling that involves in vertebrate bone formation. Singleâcell transcriptomes reveal abnormal differentiation trajectories and transcriptional events in Pak2â/â mouse embryos during neural tube development. Two nonsynonymous and one recurrent spliceâsite mutations in the PAK2 gene are identified in five human NTD fetuses, which exhibit attenuated PAK2 expression and upregulated BMP signaling in the brain. Mechanistically, PAK2 regulates Smad9 phosphorylation to inhibit BMP signaling and ultimately induce DLHP formation. Depletion of pak2a in zebrafish induces defects in the neural tube, which are partially rescued by the overexpression of wildâtype, but not mutant PAK2. The findings demonstrate the conserved role of PAK2 in neurulation in multiple vertebrate species, highlighting the molecular pathogenesis of PAK2 mutations in NTDs